Joint-Attention Learning in Prosody Transfer Speech Synthesis

Demo

1. Token factorization

Assigned tokens gave different speaker voice as synthesis results

Figure 1. Visualization of the spectrogram of 5 tokens' synthesis results trained on VCTK dataset. From top to bottom is 1 to 5.

title

Uterrance: "I’ve felt the chance that I have a number of options."

To listen, files are at following:

Token1: 
Token2: 
Token3: 
Token4: 
Token5: 

Assigned tokens gave different synthesis results

Figure 2. Visualization of the spectrogram of 5 tokens' synthesis results trained on an internal dataset. From top to bottom is 1 to 5.

title

Uterrance: "Just recovered a fumble on ensuing kickoff."

To listen, files are at following:

Token1: 
Token2: 
Token3: 
Token4: 
Token5: 

Token factorization on Blizzard 2013 dataset

Figure 3. Visualization of the spectrogram of 5 tokens' synthesis results trained on Blizzard2013 dataset. From top to bottom is 1 to 5.

title

Uterrance: "Just recovered a fumble on ensuing kickoff."

To listen, files are at following:

Token1: 
Token2: 
Token3: 
Token4: 
Token5: 

2. Prosody Transfer

Parallel utterances

</h3>

The following shows three example of prosody transfer synthesis.

In each example, text of the utterance to synthesis is the same as the reference's. The first utterance shown in each example is the reference. The second one is the synthesis results using neutral prosody. The third one is the prosody transfer result.

Example 1

Utterance text content: My mother always took him to the town on a market day in a light gig.

Refence utterance:
Neutral prosody result:
Prosody Transfer result:

Example 2

Utterance text content: So we never saw Dick any more.

Refence utterance:
Neutral prosody result:
Prosody Transfer result:

Example 3

Utterance text content: You will be to visit me in prison with a basket of provisions, you will not refuse to visit me in prison?

Refence utterance:
Neutral prosody result:
Prosody Transfer result:

Unparallel utterance

The following shows three example of unparallel prosody transfer synthesis.

In each example, text of the utterance to synthesis is different from the reference's. The first utterance shown in each example is the reference. The second and third ones are two prosody transfer synthesis results with different text contents.

Example 1

Reference text: My mother always took him to the town on a market day in a light gig.

Prosody Transfer result 1's text: So we never saw Dick any more.

Prosody Transfer result 2's text: Just recovered a fumble on ensuing kickoff.

The prosody of the unparallel reference utterance will be transfered to the synthesis results having different text contents.

Reference utterance: 
Text: My mother always took him to the town on a market day in a light gig.


Prosody Transfer text 1: 
Text: So we never saw Dick any more.


Prosody Transfer text 2: 
Text: Just recovered a fumble on ensuing kickoff.


Example 2

Reference text: You will be to visit me in prison with a basket of provisions, you will not refuse to visit me in prison?

Prosody Transfer result 1's text: My mother always took him to the town on a market day in a light gig.

Prosody Transfer result 2's text: There was nothing disagreeable in Mister Rushworth's appearance.

The prosody of the unparallel reference utterance will be transfered to the synthesis results having different text contents.

Reference utterance:
Text: You will be to visit me in prison with a basket of provisions, you will not refuse to visit me in prison?


Prosody Transfer text 1:
Text: My mother always took him to the town on a market day in a light gig.


Prosody Transfer text 2:
Text: There was nothing disagreeable in Mister Rushworth's appearance.


Example 3

Reference text: There was nothing disagreeable in Mister Rushworth's appearance, and Sir Thomas was liking him already.

Prosody Transfer result 1's text: Just recovered a fumble on ensuing kickoff.

Prosody Transfer result 2's text: My mother always took him to the town on a market day in a light gig.

The prosody of the unparallel reference utterance will be transfered to the synthesis results having different text contents.

Reference utterance:
Text: There was nothing disagreeable in Mister Rushworth's appearance, and Sir Thomas was liking him already.


Prosody Transfer text 1:
Text: Just recovered a fumble on ensuing kickoff.


Prosody Transfer text 2:
Text: My mother always took him to the town on a market day in a light gig.